Skip to content

Add PTX vector memory intrinsics#4

Closed
ilehtoranta wants to merge 1 commit into
LostBeard:masterfrom
ilehtoranta:codex/ptx-vector-memory-intrinsics
Closed

Add PTX vector memory intrinsics#4
ilehtoranta wants to merge 1 commit into
LostBeard:masterfrom
ilehtoranta:codex/ptx-vector-memory-intrinsics

Conversation

@ilehtoranta
Copy link
Copy Markdown

Summary

Adds PTX-only vector memory intrinsics for explicit f32 vector load/store code generation.

This introduces:

  • PTXMemory.LoadF32x2 / StoreF32x2
  • PTXMemory.LoadF32x4 / StoreF32x4
  • Float2 and Float4 helper structs
  • intrinsic registration in the PTX algorithms context
  • aligned/vectorized ArrayView convenience helpers

The main use case is CUDA kernels that need predictable vector memory instructions instead of relying on backend inference from ordinary scalar or struct access patterns.

Details

The new PTX intrinsics generate explicit PTX vector memory operations:

  • ld.v2.f32
  • st.v2.f32
  • ld.v4.f32
  • st.v4.f32

For f32x4, ptxas can lower these to 128-bit global memory instructions such as LD.E.128 and ST.E.128 when alignment and addressing are suitable.

This is useful for performance-sensitive kernels that operate on adjacent float values.

@LostBeard LostBeard self-assigned this May 10, 2026
@LostBeard
Copy link
Copy Markdown
Owner

Awesome! I'll take a look asap.

@ilehtoranta
Copy link
Copy Markdown
Author

Is there anything I could help with? I mean, adding more tests, for example?

LostBeard added a commit that referenced this pull request May 22, 2026
- Add PTXMemory class (ILGPU.Algorithms.PTX) with ld.v2/v4.f32 and st.v2/v4.f32 intrinsics; Float2/Float4 structs
- Add ArrayView LoadVectorized/StoreVectorized/CastAligned extension helpers
- Revert CudaAccelerator.DefaultMaxRegistersPerThread default from 255 to 0 (restores occupancy on normal kernels)
- Remap System.Numerics.BitOperations to hardware-backed IntrinsicMath methods (CLZ/PopC/CTZ)
- Add CUDA-only unit tests for all new PTX vector memory variants
- Bump ILGPU/ILGPU.Algorithms fork to 2.0.7; SpawnDev.ILGPU to 4.9.6-local.1

Addresses ilehtoranta Discussion #5 and PR #4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@LostBeard
Copy link
Copy Markdown
Owner

Merged in commit 2ec94d6, shipping in 4.9.6 (currently published to my local feed as 4.9.6-local.1; rc to nuget.org shortly).

Applied changes:

  • PTXMemory class with LoadF32x2, LoadF32x4, StoreF32x2, StoreF32x4 (ref + struct + scalar-argument forms)
  • Float2 and Float4 readonly structs
  • PTXContext.cs RegisterMemoryIntrinsics() + per-intrinsic registration helper
  • PTX code generators: GenerateLoad, GenerateStore, GenerateStoreScalars
  • PTXContext.Generated.tt + .cs wired up
  • ArrayView<T>.LoadVectorized/StoreVectorized/CastAligned extension helpers + ArrayView1D<T,Dense> overloads

The RemappedIntrinsics.cs change to retarget System.Numerics.BitOperations at the hardware-backed IntrinsicMath methods (with [MathIntrinsic(CLZ/PopC/CTZ)]) is a real cross-backend win — it now emits popc/clz/ctz (PTX), popcount/clz/ctz (OpenCL), countOneBits/countLeadingZeros/countTrailingZeros (WebGPU), i32.popcnt/clz/ctz (Wasm), and a parallel-bit-count path on WebGL, instead of compiling the C# software fallback on every backend. Nice catch.

Added CUDA-only unit tests covering all five variants (ld.v2.f32, ld.v4.f32, st.v2.f32 from struct, st.v2/v4.f32 from scalar args, and the LoadVectorized<Float2> ArrayView path) — all passing under PMT (Tests_PTXVectorMemory_F32x2_LoadStore, Tests_PTXVectorMemory_F32x4_LoadStore, Tests_PTXVectorMemory_F32x2_StoreScalars, Tests_PTXVectorMemory_F32x4_StoreScalars, Tests_ArrayView_LoadVectorized_Float2).

Closing as manually applied. Thank you for the well-structured contribution — the PTX code generators followed the existing ILGPU pattern exactly, easy to drop in.

@LostBeard LostBeard closed this May 22, 2026
@ilehtoranta
Copy link
Copy Markdown
Author

Thanks! The AI made this easy =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants